
DBSCAN?¶The labelled faces dataset of sckit-learn contains gray scale images of 62 differnet famous personalites
from politics. In this exercise, we assume that there are no target labels, i.e. the names of
the persons are unknown. We want to find a method to cluster similar images. This can be done
using a dimensionality reduction algorithm like PCA for feature generation and a subsequent
clustering e.g. using DBSCAN.
%matplotlib inline
from IPython.display import set_matplotlib_formats, display
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from cycler import cycler
plt.rcParams['image.cmap'] = "gray"
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
Open the Jupyter notebook DBSCAN_DetectSimilarFaces.jpynb and have a look at the
first few faces of the dataset. Not every person is represented equally frequent in this
unbalanced dataset. For classification, we would have to take this into account. We extract
the first 50 images of each person and put them into a flat array called X_people. The
correspinding targets (y-values, names), are stored in the y_people array.
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_lfw_people
people = fetch_lfw_people(min_faces_per_person=20, resize=2)
image_shape = people.images[0].shape
fig, axes = plt.subplots(2, 5, figsize=(15, 8),
subplot_kw={'xticks': (), 'yticks': ()})
for target, image, ax in zip(people.target, people.images, axes.ravel()):
ax.imshow(image)
ax.set_title(people.target_names[target])
np.shape(people.images)
(3023, 250, 188)
left = 5
top = 5
right = image_shape[1]-left
bottom = image_shape[0]-top
import PIL
for img in people.images[0:3,:,:]:
#pil_img = PIL.Image.fromarray(np.uint8(img*255))
pil_img = PIL.Image.fromarray(img)
plt.figure()
plt.imshow(np.array(pil_img))
pil_img = pil_img.crop((left, top, right, bottom))
plt.figure()
plt.imshow(np.array(pil_img))
np.shape(np.array(pil_img))
(240, 178)
print("people.images.shape: {}".format(people.images.shape))
print("Number of classes: {}".format(len(people.target_names)))
people.images.shape: (3023, 250, 188) Number of classes: 62
people.target_names
array(['Alejandro Toledo', 'Alvaro Uribe', 'Amelie Mauresmo',
'Andre Agassi', 'Angelina Jolie', 'Ariel Sharon',
'Arnold Schwarzenegger', 'Atal Bihari Vajpayee', 'Bill Clinton',
'Carlos Menem', 'Colin Powell', 'David Beckham', 'Donald Rumsfeld',
'George Robertson', 'George W Bush', 'Gerhard Schroeder',
'Gloria Macapagal Arroyo', 'Gray Davis', 'Guillermo Coria',
'Hamid Karzai', 'Hans Blix', 'Hugo Chavez', 'Igor Ivanov',
'Jack Straw', 'Jacques Chirac', 'Jean Chretien',
'Jennifer Aniston', 'Jennifer Capriati', 'Jennifer Lopez',
'Jeremy Greenstock', 'Jiang Zemin', 'John Ashcroft',
'John Negroponte', 'Jose Maria Aznar', 'Juan Carlos Ferrero',
'Junichiro Koizumi', 'Kofi Annan', 'Laura Bush',
'Lindsay Davenport', 'Lleyton Hewitt', 'Luiz Inacio Lula da Silva',
'Mahmoud Abbas', 'Megawati Sukarnoputri', 'Michael Bloomberg',
'Naomi Watts', 'Nestor Kirchner', 'Paul Bremer', 'Pete Sampras',
'Recep Tayyip Erdogan', 'Ricardo Lagos', 'Roh Moo-hyun',
'Rudolph Giuliani', 'Saddam Hussein', 'Serena Williams',
'Silvio Berlusconi', 'Tiger Woods', 'Tom Daschle', 'Tom Ridge',
'Tony Blair', 'Vicente Fox', 'Vladimir Putin', 'Winona Ryder'],
dtype='<U25')
# count how often each target appears
counts = np.bincount(people.target)
# print counts next to target names:
for i, (count, name) in enumerate(zip(counts, people.target_names)):
print("{0:25} {1:3}".format(name, count), end=' ')
if (i + 1) % 3 == 0:
print()
Alejandro Toledo 39 Alvaro Uribe 35 Amelie Mauresmo 21 Andre Agassi 36 Angelina Jolie 20 Ariel Sharon 77 Arnold Schwarzenegger 42 Atal Bihari Vajpayee 24 Bill Clinton 29 Carlos Menem 21 Colin Powell 236 David Beckham 31 Donald Rumsfeld 121 George Robertson 22 George W Bush 530 Gerhard Schroeder 109 Gloria Macapagal Arroyo 44 Gray Davis 26 Guillermo Coria 30 Hamid Karzai 22 Hans Blix 39 Hugo Chavez 71 Igor Ivanov 20 Jack Straw 28 Jacques Chirac 52 Jean Chretien 55 Jennifer Aniston 21 Jennifer Capriati 42 Jennifer Lopez 21 Jeremy Greenstock 24 Jiang Zemin 20 John Ashcroft 53 John Negroponte 31 Jose Maria Aznar 23 Juan Carlos Ferrero 28 Junichiro Koizumi 60 Kofi Annan 32 Laura Bush 41 Lindsay Davenport 22 Lleyton Hewitt 41 Luiz Inacio Lula da Silva 48 Mahmoud Abbas 29 Megawati Sukarnoputri 33 Michael Bloomberg 20 Naomi Watts 22 Nestor Kirchner 37 Paul Bremer 20 Pete Sampras 22 Recep Tayyip Erdogan 30 Ricardo Lagos 27 Roh Moo-hyun 32 Rudolph Giuliani 26 Saddam Hussein 23 Serena Williams 52 Silvio Berlusconi 33 Tiger Woods 23 Tom Daschle 25 Tom Ridge 33 Tony Blair 144 Vicente Fox 32 Vladimir Putin 49 Winona Ryder 24
87*65
5655
mask = np.zeros(people.target.shape, dtype=bool)
for target in np.unique(people.target):
mask[np.where(people.target == target)[0][:50]] = 1
X_people = people.data[mask]
y_people = people.target[mask]
# scale the grey-scale values to be between 0 and 1
# instead of 0 and 255 for better numeric stability:
X_people = X_people / 255.
NumberOfPeople=np.unique(people.target).shape[0]
TargetNames = [];
n=5
#find the first 5 images from each person
fig, axes = plt.subplots(12, 5, figsize=(15, 30),
subplot_kw={'xticks': (), 'yticks': ()})
for target,ax in zip(np.unique(people.target),axes.ravel()):
#get the first n pictures from each person
indices=np.where(people.target == target)[0][1:n+1]
TargetNames.append(people.target_names[target])
image=people.images[indices[0]]
ax.imshow(image)
ax.set_title(str(target)+': '+TargetNames[target])
Apply now a principal component analysis X_pca=pca.fit_transform(X_people) and
extract the first 100 components of each image. Reconstruct the first 10 entries of the dataset
using the 100 components of the PCA transformed data by applying the
pca.inverse_transform method and reshaping the image to the original size using
np.reshape.
What is the minimum number of components necessary such that you recognize the persons? Try it out.
NumberOfPeople
62
#extract eigenfaces from lfw data and transform data
from sklearn.decomposition import PCA
pca = PCA(n_components=100, whiten=True, random_state=0)
X_pca = pca.fit_transform(X_people)
#X_pca = pca.transform(X_people)
image_shape = people.images[0].shape
NumberOfSamples=X_pca.shape[0]
fig, axes = plt.subplots(2, 5, figsize=(15, 8),
subplot_kw={'xticks': (), 'yticks': ()})
for ix, target, ax in zip(np.arange(NumberOfSamples), y_people, axes.ravel()):
image=np.reshape(pca.inverse_transform(X_pca[ix,:]),image_shape)
ax.imshow(image)
ax.set_title(str(y_people[ix])+': '+people.target_names[target])
Import DBSCAN class from sklearn.cluster, generate an instance called dbscan and apply it to the pca transformed data X_pca and extract the cluster labels using labels = dbscan.fit_predict(X_pca). Use first the standard parameters for the method and check how many unique clusters the algorithm could find by analyzing the number of unique entries in the predicted cluster labels.
# apply DBSCAN with default parameters
from sklearn.cluster import DBSCAN
dbscan = DBSCAN()
labels = dbscan.fit_predict(X_pca)
print("Unique labels: {}".format(np.unique(labels)))
Unique labels: [-1]
eps parameter¶Change the parameter eps of the dbscan using dbscan(min_samples=3, eps=5). Change
the value of eps in the range from 6 to 8 in small steps and check for
each value of eps how many clusters could be determined. Save the labels from the clustering that yields the largest number of clusters.
max_clust = 0
labels_max = [] # for storing the labels from the clustering with the largest number of clusters
for eps in np.linspace(6,8,51):
print("\neps={}".format(eps))
dbscan = DBSCAN(eps=eps, min_samples=3)
labels = dbscan.fit_predict(X_pca)
if max_clust < len(np.unique(labels)):
max_clust = len(np.unique(labels))
labels_max = labels
print("Number of clusters: {}".format(len(np.unique(labels))))
print("Cluster sizes: {}".format(np.bincount(labels + 1)))
eps=6.0 Number of clusters: 4 Cluster sizes: [2050 3 7 3] eps=6.04 Number of clusters: 4 Cluster sizes: [2048 3 7 5] eps=6.08 Number of clusters: 4 Cluster sizes: [2048 3 7 5] eps=6.12 Number of clusters: 4 Cluster sizes: [2048 3 7 5] eps=6.16 Number of clusters: 5 Cluster sizes: [2045 3 7 3 5] eps=6.2 Number of clusters: 6 Cluster sizes: [2042 3 7 3 3 5] eps=6.24 Number of clusters: 6 Cluster sizes: [2042 3 7 3 3 5] eps=6.28 Number of clusters: 6 Cluster sizes: [2042 3 7 3 3 5] eps=6.32 Number of clusters: 6 Cluster sizes: [2042 3 7 3 3 5] eps=6.36 Number of clusters: 6 Cluster sizes: [2042 3 7 3 3 5] eps=6.4 Number of clusters: 6 Cluster sizes: [2041 3 7 3 6 3] eps=6.44 Number of clusters: 7 Cluster sizes: [2036 3 4 7 6 3 4] eps=6.48 Number of clusters: 7 Cluster sizes: [2032 3 5 7 7 6 3] eps=6.52 Number of clusters: 7 Cluster sizes: [2030 3 7 7 7 6 3] eps=6.5600000000000005 Number of clusters: 7 Cluster sizes: [2028 3 8 7 8 6 3] eps=6.6 Number of clusters: 7 Cluster sizes: [2026 3 6 8 10 7 3] eps=6.64 Number of clusters: 8 Cluster sizes: [2018 4 13 8 10 4 3 3] eps=6.68 Number of clusters: 7 Cluster sizes: [2010 4 13 25 5 3 3] eps=6.72 Number of clusters: 7 Cluster sizes: [2009 4 13 26 5 3 3] eps=6.76 Number of clusters: 6 Cluster sizes: [2005 36 13 3 3 3] eps=6.8 Number of clusters: 7 Cluster sizes: [1999 38 14 3 3 3 3] eps=6.84 Number of clusters: 6 Cluster sizes: [1991 49 14 3 3 3] eps=6.88 Number of clusters: 8 Cluster sizes: [1985 49 14 3 3 3 3 3] eps=6.92 Number of clusters: 7 Cluster sizes: [1975 61 14 3 4 3 3] eps=6.96 Number of clusters: 8 Cluster sizes: [1962 68 4 14 6 3 3 3] eps=7.0 Number of clusters: 7 Cluster sizes: [1954 77 7 4 14 4 3] eps=7.04 Number of clusters: 9 Cluster sizes: [1943 82 7 4 14 3 4 3 3] eps=7.08 Number of clusters: 10 Cluster sizes: [1929 93 7 4 14 3 4 3 3 3] eps=7.12 Number of clusters: 7 Cluster sizes: [1924 118 7 3 4 4 3] eps=7.16 Number of clusters: 6 Cluster sizes: [1916 134 3 3 4 3] eps=7.2 Number of clusters: 4 Cluster sizes: [1909 147 3 4] eps=7.24 Number of clusters: 5 Cluster sizes: [1893 158 3 5 4] eps=7.28 Number of clusters: 4 Cluster sizes: [1887 165 7 4] eps=7.32 Number of clusters: 4 Cluster sizes: [1877 174 7 5] eps=7.36 Number of clusters: 5 Cluster sizes: [1863 185 3 7 5] eps=7.4 Number of clusters: 4 Cluster sizes: [1857 196 3 7] eps=7.4399999999999995 Number of clusters: 6 Cluster sizes: [1842 205 3 3 7 3] eps=7.48 Number of clusters: 6 Cluster sizes: [1828 219 3 3 7 3] eps=7.52 Number of clusters: 4 Cluster sizes: [1817 240 3 3] eps=7.5600000000000005 Number of clusters: 5 Cluster sizes: [1799 254 3 3 4] eps=7.6 Number of clusters: 5 Cluster sizes: [1788 265 3 3 4] eps=7.640000000000001 Number of clusters: 4 Cluster sizes: [1774 283 3 3] eps=7.68 Number of clusters: 4 Cluster sizes: [1755 302 3 3] eps=7.72 Number of clusters: 5 Cluster sizes: [1746 308 3 3 3] eps=7.76 Number of clusters: 4 Cluster sizes: [1739 318 3 3] eps=7.8 Number of clusters: 5 Cluster sizes: [1722 330 4 3 4] eps=7.84 Number of clusters: 5 Cluster sizes: [1701 351 4 3 4] eps=7.88 Number of clusters: 5 Cluster sizes: [1691 361 4 3 4] eps=7.92 Number of clusters: 4 Cluster sizes: [1684 372 3 4] eps=7.96 Number of clusters: 4 Cluster sizes: [1672 384 4 3] eps=8.0 Number of clusters: 3 Cluster sizes: [1655 405 3] 10
Plot the members of the clusters with less than 10 samples from the clustering with largest number of clusters using the follwing python code.
# the labels from the clustering with the largest number of clusters
labels = labels_max
for cluster in range(max(labels) + 1):
mask = labels == cluster
n_images = np.sum(mask)
print("Cluster number: {}".format(cluster))
print("Cluster size: {}".format(n_images))
if n_images<11:
fig, axes = plt.subplots(1, n_images, figsize=(n_images * 1.5, 4),
subplot_kw={'xticks': (), 'yticks': ()})
for image, label, ax in zip(X_people[mask], y_people[mask], axes):
ax.imshow(image.reshape(image_shape))
ax.set_title(people.target_names[label].split()[-1])
Cluster number: 0 Cluster size: 93 Cluster number: 1 Cluster size: 7 Cluster number: 2 Cluster size: 4 Cluster number: 3 Cluster size: 14 Cluster number: 4 Cluster size: 3 Cluster number: 5 Cluster size: 4 Cluster number: 6 Cluster size: 3 Cluster number: 7 Cluster size: 3 Cluster number: 8 Cluster size: 3
# %% using other cluster algorithms learner on the pca transformed data
from time import time
from sklearn import cluster
from sklearn.neighbors import kneighbors_graph
n_clusters=14
clustering_names = ['SpectralClustering', 'Ward', 'AverageLinkage']
connectivity = kneighbors_graph(X_pca, n_neighbors=n_clusters, include_self=False)
# make connectivity symmetric
connectivity = 0.5 * (connectivity + connectivity.T)
spectral = cluster.SpectralClustering(n_clusters=n_clusters,
eigen_solver='arpack',
affinity="nearest_neighbors")
ward = cluster.AgglomerativeClustering(n_clusters=n_clusters, linkage='ward',
connectivity=connectivity)
average_linkage = cluster.AgglomerativeClustering(
linkage="average", affinity="cityblock", n_clusters=n_clusters,
connectivity=connectivity)
clustering_algorithms = [spectral, ward, average_linkage]
# %matplotlib inline
for name, algorithm in zip(clustering_names, clustering_algorithms):
# predict cluster memberships
print(algorithm)
t0 = time()
algorithm.fit(X_pca)
t1 = time()
if hasattr(algorithm, 'labels_'):
labels = algorithm.labels_.astype(int)
else:
labels = algorithm.predict(X_pca)
print("%s: %.2g sec" % (name,t1 - t0))
print('labels found: %i' % (max(labels) + 1))
print("_____________________________________________")
print(" %s " % (name))
print("_____________________________________________")
for cluster in range(max(labels) + 1):
mask = labels == cluster
ind=np.where(mask==True)[0]
n_images = np.size(ind)
submask=np.zeros(X_pca.shape[0])
submask=submask.astype(dtype=bool)
submask[ind]=True
#n_images = np.sum(mask)
#print(n_images)
max_image=np.min([n_images,8])
print('max image: %i\n' % (max_image))
fig, axes = plt.subplots(1, max_image, figsize=(max_image * 3, 3),
subplot_kw={'xticks': (), 'yticks': ()})
if max_image==1:
print(ind[0])
image=X_people[ind[0]]
label=y_people[ind[0]]
plt.imshow(image.reshape(image_shape))
plt.title(people.target_names[label].split()[-1])
else:
for image, label, ax in zip(X_people[submask], y_people[submask], axes):
ax.imshow(image.reshape(image_shape))
ax.set_title(people.target_names[label].split()[-1])
plt.show()
SpectralClustering(affinity='nearest_neighbors', eigen_solver='arpack',
n_clusters=14)
SpectralClustering: 3 sec
labels found: 14
_____________________________________________
SpectralClustering
_____________________________________________
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 7
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
AgglomerativeClustering(connectivity=<2063x2063 sparse matrix of type '<class 'numpy.float64'>'
with 54374 stored elements in Compressed Sparse Row format>,
n_clusters=14)
Ward: 1.1 sec
labels found: 14
_____________________________________________
Ward
_____________________________________________
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
AgglomerativeClustering(affinity='cityblock',
connectivity=<2063x2063 sparse matrix of type '<class 'numpy.float64'>'
with 54374 stored elements in Compressed Sparse Row format>,
linkage='average', n_clusters=14)
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_agglomerative.py:983: FutureWarning: Attribute `affinity` was deprecated in version 1.2 and will be removed in 1.4. Use `metric` instead warnings.warn(
AverageLinkage: 6.4 sec
labels found: 14
_____________________________________________
AverageLinkage
_____________________________________________
max image: 8
max image: 1 1989
max image: 2
max image: 1 1606
max image: 1 1496
max image: 1 1713
max image: 1 1881
max image: 1 627
max image: 1 1219
max image: 1 661
max image: 1 63
max image: 1 1543
max image: 1 1507
max image: 1 1090